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AN INDEX STRUCTURE FOR SUPPORTING STRUCTURAL XML QUERIES 

BACKGROUND OF THE INVENTION 

1. Field of the Invention 

The present invention relates to databases and, more particularly, to an index 
structure for searching XML documents. 

2. Description of the Related Art 

XML provides a flexible way to define semi-structured data. For instance, 
purchase records that contain information of buyers and sellers can be described by the 
document type definition (hereinafter referred to as "DTD") schema shown in Figure L 
DTD is a common schema specification method for XML documents. A sample XML 
document based on this DTD is shown in Figure 3. 

The ability to express complex structural or graphical queries is one of the major 
focuses in XML query language design. In Figure 2, four sample queries in graph form 
are shown. It is well-known in the art that querying XML data is equivalent to finding 
sub-structures of the data graph that match the query structure. 

Many of the current approaches to querying XML data create indexes on paths 
(e.g., "/P/S/I/M" as in |2i ) or nodes in DTD trees. Path indexes can answer simple 
queries such as Q\ efficiently. However, queries involving branching structures (Qi , for 

instance) usually have to be disassembled into multiple sub-queries, each sub-query 
corresponding to a single path in the graph. The results of these sub-queries are then 
combined by expensive "join" operations to produce final answers. For the same reason, 
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these methods are also inefficient in handling or 7/' queries (Q3 and Q4, for instance), 
which too, correspond to multiple paths. To avoid expensive join operations, some index 
methods create special index entries for frequently occurring multiple-path queries 
(commonly referred to as "refined paths"). The potential disadvantages of this approach 

5 include: 1) there is a need to monitor query pattems; 2) it is not a general approach 

because not every branching query is optimized; and 3) the number of refined paths can 
have a huge impact on the size and the maintenance cost of the index. 

Moreover, to retrieve semi-structured data (e.g., XML documents) efficiently, it is 
essential to index on both structure and content of the XML data. Nevertheless, many 

10 algorithms index on structure only, or index on structure and content separately, which 
means, for instance, attribute values in Q2 .Qi, and Q4 are not used for filtering in the 
most effective way. 

Another important aspect to XML indexing is whether the index structure 
supports dynamic data insertion, deletion, and update, and whether the index depends on 

1 5 specialized data structures not well-supported by database systems. 

SUMMARY OF THE INVENTION 
In one aspect of the present invention, a method of generating a virtual suffix tree 
(ViST) structure for searching XML documents is provided. The method comprises 
20 receiving one or more XML documents; converting the one or more XML documents into 
one or more structure-encoded sequences. The method further comprises generating the 
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ViST structure comprising: generating a D- Ancestor index; generating an S-Ancestor 
index; and generating a doc-ID index. 

In a second aspect of the present invention, a method of answering an XML query 
is provided. The method comprises receiving an XML query; transforming the XML 
5 query into a structure-encoded sequence; and searching a ViST structure using the 
structure-encoded sequence and returning one or more document IDs. 

In a third aspect of the present invention, a method of dynamically updating the 
ViST structure is provided. The method comprises receiving a new XML document; 
transforming the XML document into a structure-encoded sequence; inserting each 
10 element of the sequence into D- Ancestor B^Tree; assigning a new label if the step of 
inserting creates a new node; and inserting the new label into the S-Ancestor BT'ree. 

In a fourth aspect of the present invention, a machine-readable medium having 
instructions stored thereon for execution by a processor to perform a method of 
generating a virtual suffix tree (ViST) structure for searching XML documents is 
15 provided. The method comprises receiving one or more XML documents; converting the 
one or more XML documents into one or more structure-encoded sequences. The method 
further comprises generating the ViST structure comprising: generating a D-Ancestor 
index; generating an S-Ancestor index; and generating a doc-ID index. 

In a fifth aspect of the present invention, a machine-readable medium having 
20 instructions stored thereon for execution by a processor to perform a method answering 
an XML query is provided. The method comprises receiving an XML query; 
transforming the XML query into a structure-encoded sequence; and searching a ViST 
structure using the structure-encoded sequence and retuming one or more document IDs. 
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In a sixth aspect of the present invention, a machine-readable medium having 
instructions stored thereon for execution by a processor to perform a method of 
dynamically updating the ViST structure is provided. The method comprises receiving a 
new XML document; transforming the XML document into a structure-encoded 
5 sequence; inserting each element of the sequence into D- Ancestor B^Tree; assigning a 
new label if the step of inserting creates a new node; and inserting the new label into the 
S-Ancestor B^Tree. 



BRIEF DESCRIPTION OF THE DRAWINGS 
10 The invention may be understood by reference to the following description taken 

in conjunction with the accompanying drawings, in which like reference numerals 
identify like elements, and in which: 

Figure 1 shows a document type definition schema; 
Figure 2 shows four sample XML queries; 
1 5 Figure 3 shows a table of structure-encoded sequences, in accordance with one 

embodiment of the present invention; 

Figure 4 shows a structure-encoded sequence, in accordance with one 
embodiment of the present invention; 

Figure 5 shows using a suffix-tree-like structure to index structure-encoded 

20 sequences for non-contiguous matching, in accordance with one embodiment of the 
present invention; 

Figure 6 shows an index structure of RIST, in accordance with one embodiment 

of the present invention; 
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Figure 7 shows an XML scheme, in accordance with one embodiment of the 
present invention; 

Figure 8 shows a dynamic range allocation, in accordance with one embodiment 
of the present invention; 

5 Figure 9(a) shows an index prior to insertion, in accordance with one embodiment 

of the present invention; and 

Figure 9(b) shows an index after insertion, in accordance with one embodiment of 
the present invention. 
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DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS 
Illustrative embodiments of the invention are described below. In the interest of 
clarity, not all features of an actual implementation are described in this specification. It 
will of course be appreciated that in the development of any such actual embodiment, 
5 numerous implementation-specific decisions must be made to achieve the developers* 
specific goals, such as compliance with system-related and business-related constraints, 
which will vary from one implementation to another. Moreover, it will be appreciated 
that such a development effort might be complex and time-consuming, but would 
nevertheless be a routine undertaking for those of ordinary skill in the art having the 
1 0 benefit of this disclosure. 

While the invention is susceptible to various modifications and alternative forms, 
specific embodiments thereof have been shown by way of example in the drawings and 
are herein described in detail. It should be understood, however, that the description 
herein of specific embodiments is not intended to limit the invention to the particular 
15 forms disclosed, but on the contrary, the intention is to cover all modifications, 

equivalents, and alternatives falling within the spirit and scope of the invention as defined 
by the appended claims. 

It is to be understood that the systems and methods described herein may be 
implemented in various forms of hardware, software, firmware, special purpose 
20 processors, or a combination thereof In particular, the present invention is preferably 
implemented as an application comprising program instructions that are tangibly 
embodied on one or more program storage devices (e.g., hard disk, magnetic floppy disk, 
RAM, ROM, CD ROM, etc.) and executable by any device or machine comprising 
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suitable architecture, such as a general purpose digital computer having a processor, 
memory, and input/output interfaces. It is to be further understood that, because some of 
the constituent system components and process steps depicted in the accompanying 
Figures are preferably implemented in software, the connections between system modules 
5 (or the logic flow of method steps) may differ depending upon the manner in which the 
present invention is programmed. Given the teachers herein, one of ordinary skill in the 
related art will be able to contemplate these and similar implementations of the present 
invention. 

With the growing importance of XML in data exchange, much research has been 
10 done in providing flexible query facilities to extract data from structured XML 

documents. The present disclosure presents ViST (or "virtual suffix tree"), which is a 
novel index structure for searching XML documents. By representing both XML 
documents and XML queries in structure-encoded sequences, it is shown that querying 
XML data is equivalent to finding (non-contiguous) subsequence matches. A variety of 
15 XML queries, including those with branches, or wild-cards (**' and 7/'), can be 

expressed by structure-encoded sequences. Unlike index methods that disassemble a 
query into multiple sub-queries, and then join the results of these sub-queries to provide 
the final answers, ViST uses tree structures as the basic unit of query to avoid expensive 
join operations. Furthermore, ViST provides a unified index on both content and 
20 structure of the XML documents, hence it has a performance advantage over methods in- 
dexing either just content or structure. ViST supports dynamic index update, and it relies 
solely on BTrees without using any specialized data structures that are not well 
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supported by common database management systems (hereinafter referred to 
as"DBMSs"). 

The ViST structure comprises two parts, the "D-Ancestor index" and the 
"S-Ancestor index." The D-Ancestor index indexes nodes by their ancestor-descendant 
relationships in the original XML document tree. The S-Ancestor index indexes nodes by 
their ancestor-descendant relationships in the virtual suffix tree. By combining the two 
parts, structural XML queries can be answered in a way similar to substring matching 
using suffix trees. 

ViST also answers challenges in index structure design. Unlike many previous 
methods that index either just structure or content of the XML data, ViST unifies 
structural indexes and value indexes into a single index. In addition, a technique called 
dynamic virtual suffix tree labeling is proposed, in which structural XML queries, as well 
as dynamic index update, can be performed directly on B'^Trees, instead of relying on 
specialized data structures such as suffix trees that are not well supported by DBMSs. 

It is believed that ViST is the first approach that provides all of the following 
features at the same time. The list below is not intended, and should not be construed as 
an exhaustive list. 

(1) Unlike most indexing methods that disassemble a structured query into 
multiple sub-queries, and then join the results of these sub-queries to 
provide the final answers, ViST uses tree structures as the basic unit of 
query to avoid expensive join operations. 
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(2) ViST provides a unified index on both the content and the structure of 
XML documents, hence it has a performance advantage over methods 
indexing either just content or structure. 

(3) Unlike some XML indexing approaches that rely on specialized data 
5 structures such as the suffix tree, which is not well-supported for 

disk-based data, we rely on the mature disk-based B"^ Tree index. 

(4) . ViST supports dynamic data insertion and deletion. 

Most known XML indexing algorithms rely on specialized data structures, for 
example, suffice trees, path trees, etc. These structures are not well-supported in the 
10 commercial DBMS's. Thus, in order to support such an XML index, the DBMS's must 
implement these specialized data structures. This is not an easy task to implement, 
however, because of the consideration of database issues such as concurrency control, 
locking, etc. On the other hand, the B^Tree overcomes these issues. 

Structure-encoded sequences, which are sequential representations of both XML 
15 data and XML queries, will now be presented. It will be shown that querying XML is 
equivalent to finding subsequence matches. 

The purpose of modeling XML queries through sequence matching is to avoid as 

many unnecessary join operations as possible in query processing. That is, 

structure-encoded sequences are used, instead of nodes or paths (as is commonly used), as 

20 the basic unit of query. Through sequence matching, structured queries are matched 

against structured data as a whole, without breaking down the queries into sub-queries of 

paths or nodes and relying on join operations to combine their results. Several common 

XML databases (e.g., Digital Bibliography & Library Project, Internet movie database, 
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etc.) contain a large set of records of the same structure. Other XML databases may not 
be as homogeneous. A synthetic XMARX dataset consists of one (generally large) 
record. However, each sub-structure in XMARK's schema (e.g., items, closed auction, 
open auction, person, etc.) contains a large number of instances in the database. 
Therefore, each sub-structure should have an index of its own. The sequence matching 
approach described herein ensures that queries confined within the same structure are 
matched as a whole. 

Consider the XML purchase record shown in Figure 3. Capital letters represent 
names of elements/attributes, and a hash function, h{) encodes attribute values into 
integers. Suppose, for instance, Vi = A("deir') and V2 = /i("ibm"). Values vi and V2 
represent "dell" and "ibm," respectively. 

An XML document is represented by the preorder sequence of its tree structure. 
For the purchase record example of Figure 3, its preorder sequence is shown below. 

PSNv,IMv2Nv3lMv4lNv5Lv6BLv7Nv8 

Because isomorphic trees may produce different preorder sequences, an order 
among sibling nodes is enforced. The DTD schema embodies a linear order of all 
elements/attributes defined therein. If the DTD is not available, the lexicographical order 
of the names of the elements/attributes is used. For example, under lexicographical order, 
the Buyer node will precede the Seller node under Purchase. Multiple occurring child 
nodes (such as the Item nodes under Seller) are ordered arbitrarily. As shown below, 
branching queries generally require special handling when multiple occurring child nodes 
are involved. 
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To reconstruct trees from preorder sequences, extra information is needed. A 
structure-encoded sequence, as defined herein, is a two dimensional sequence, where the 
second dimension preserves the structure of the data. The structure-encoded sequence is 
derived from a prefix traversal of a semi-structured XML document. The 
structure-encoded sequence is a sequence of (symbol, prefix) pairs: 

D^ia\,p\X (aiyPiX (an.pn) 
where a, represents a node in the XML document tree, (of which a i , a;, is the preorder 
sequence), and pi is the path from the root node to node a,. 

Based on this definition of the structure-encoded sequence, the XML purchase 
record of Figure 3 can be converted to the structure-encoded sequence in Figure 4. As 
shown in Figure 4, the underlined non-contiguous subsequence ofD matches query Q2 
shown below in Table 2. The prefixes in the sequential representation contain much 
redundant information. However, because, duplicate (symbol, prefix) pairs are not stored 
in the index and the prefixes can be encoded easily (as shown below), the prefixes will 
not create problems in index size or storage. 

Path Expression Structure-Encoded Sequence 

~Q^rjPurcha^J^^^^ (P, e){S, P)[I, PS){M, PSI) 

Q2 : lPuTchase\Sellei\Loc - v^]]/ Buyer[Loc = v-j] (P, e)(5, P)(L, PS){v^, PSL)(B, P){L, PB)(v^, PEL) 
Qz : I Purchase/* [hoc = vb] (P, (:){L, P*){v&, P*L) 

Q4 : I Purchase! I Item[ManufactuTer = (^i ^)(^» ^//) ^//^)(^3 , P/IlM) 

lUble 2: XML Queries in Path Expression and Sequence Foroi 

In the same spirit, XML queries are can be converted into structure-encoded se- 
quences. The queries in Figure 2 can be transformed to the structure-encoded sequences 
in Table 2. The following rules are observed in the conversion: 
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(1) Just like converting XML data, preorder sequences are used to represent 
queries. (Example: Q\ , Q2 ). 

(2) Wild-card nodes and 7/') are discarded. However, the prefix paths of 
their sub nodes will contain a **' or 7/' symbol as a place holder. As 
shown below, and 7/' are handled as range queries by ViST in 
sequence matching. (Example: Q^^Q^) 

The purpose of introducing structure-encoded sequences is to model XML queries 
through sequence matching. In other words, querying XML is equivalent to finding 
(non-contiguous) subsequence matches. This by queries d , • * •> in Table 2 above. 

The structure-encoded sequence of is a subsequence ofD. Q\ isa sub tree of the 
XML purchase record that D represents. The sequence of Q2 is a non-contiguous 
subsequence of D, and again, Q2 is a sub tree of the XML purchase record. The same can 
be said for queries Q3 and where prefix paths contain wild-cards and 7/', if is 
matched with any single symbol in the path, and 7/' is matched with any portion of the 
path. 

The obvious benefits of modeling XML queries through sequence matching is that 
structural queries can be processed as a whole instead of being broken down into smaller 
query units (e.g., paths or nodes of XML document trees), as combining the results of the 
sub-queries by join operations is often expensive. In other words, structures are used as 
the basic unit of query. 

Most structural XML queries can be performed through direct subsequence 
matching. The only exception occurs when a branch has multiple identical child nodes. 
For instance, in Q5 = IA[BIC]IBID, the two nodes under the branch are the same: B. In 
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this case, the tree isomorphism problem cannot be avoided by enforcing sibling orders 
because the two nodes are identical. As a result, the preorder sequences of XML data 
trees that contain such a branch can have two possible forms. In order to find all matches, 
Q5 is converted to two different sequences, namely, (A,e)(B,A)(CjAB)(B,A)(D,AB) and 

(A,e)(B,A)(D,AB)(B,A)(CyAB), Matches for these two sequences are separately found 
and their results are combined in a "union" operation. On the other hand, false matches 
may be found if the indexed documents contain branches with identical child nodes. 
Then, multiple queries asked and a "set difference" on their results is computed. If, in the 
unlikely case, the query contains a large number of same child nodes under a branch, the 
tree can be disassembled at the branch into multiple trees. Join operations can be used to 
combine their results. For instance, Qs can be disassembled into two trees: 
(A,e)(B,A)(QAB) and (A,e)(B,A)(D,AB). It should be noted that Qs is a special case 
where each split tree is a single path. 

After both XML data and XML queries are converted to structure-encoded 
sequences, it is straightforward to one of ordinary skill in the art to devise a brute force 
algorithm to perform (non-contiguous) sequence matching. The rest of this disclosure 
will describe building a dynamic index structure so that such matches can be found 
efficiently. 

ViST will now be presented in three stages. First, a naive algorithm is presented. 
The naive algorithm, based entirely on suffix trees, requires traversal of a large portion of 

the tree structure for noncontiguous subsequence matching. Second, relationships 
indexed suffix tree (hereinafter referred to as "RIST") is presented. RIST improves the 
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naive algorithm by using B^Trees to index suffix tree nodes. Finally, ViST is presented. 
ViST is an index structure having the same functionality as RIST but relies exclusively 
on B^Trees. 

The desiderata of an XML indexing method include: 

1 . The index method should support structural queries directly. With 
structure-encoded sequences, this requirement is equivalent to having 
efficient support for (non-contiguous) subsequence matching. 

2. Instead of relying on specialized data structures such as suffix trees, the 
index method should leverage well-supported database indexing 
techniques such as B^Trees. 

3. The index structure should allow dynamic data insertion, deletion, etc. 
Figure 5 shows an example of using a suffix-tree-like structure to index 

structure-encoded sequences for non-contiguous matching. Two sequences, Doc\ and 
Doci , are inserted into the suffix tree. Originally, the elements in the sequences represent 
nodes in the XML document trees, from which the sequences are derived. Now, the 
elements also represent nodes in the suffix tree. Because the nodes are involved in two 
different trees, two kinds of ancestor-descendant relationships among the sequence 
elements arise: 1) the ancestor-descendant relationships of the nodes that they represent in 
the original XML document tree; and 2) the ancestor-descendant relationships of the 
nodes that they represent in the suffix tree. The first relationship is called the 
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D-Ancestorship. For instance, element (S,P) is a D-ancestor of (1,^5"). The second 
relationship is called the S-Ancestorship. For instance, element (v\,PSN) is an 
S-Ancestor of (L, PS). 

The algorithm below presents a naive method for non-contiguous subsequence 

matching: 



10 



15 



20 



Input: Q = gi, ■ ■ • a query sequence 

5, a suffix tree for a set of sequences 
Output: all occurrences of Q in S 

/* Search begins at the root of the suffix tree */ NaiveSearchiS- 
rootyl); 

Pu net ion NaiveSearch(n^t) 
\ti<k then 

for each node c that is a descendent of node n do 
/* n is an S-Ancestor of c */ 
if c matches qi then 

/* n is a D-Ancestor of c */ 
NaiveSearck{cyi + 1); 
end 
end 
else 

I Output all document IDs attached to the nodes under node n\ 
end 



Suppose node x is one of the nodes matching q\,-'^ • To match the next element qt, 
we check all the nodes under ;c, which are the nodes satisfying the S-Ancestorship. 
Among them, we find those that match 9,'s (Symbol, Prefix) pair, which are the nodes 
satisfying the D-Ancestorship, as Prefix encodes D-Ancestorship in the XML document 
tree. For example, to match Qi, we start with the root node, which matches the first 
element of Q2, iP, e). Then, we search under the root for all nodes that match (I, P 
which (L,PS) and Finally, we search for {vi.PSL) (wildcard in the query is 
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instantiated to 'S' by the previous match) under the node labeled and iyi.PBL) 

under the node labeled {L,PB). 

In essence, the algorithm above searches nodes first by S-Ancestorship (searching 
under a suffix tree node), and then D-Ancestorship (matching nodes by symbols and 
5 prefixes). The algorithm supports structural query. However, there are several 

difficulties in using a suffix tree to index structure-encoded sequences. First, searching 
for nodes satisfying both S-Ancestorship and D-Ancestorship is extremely costly because 
we need to traverse a large portion of the subtree for each match. Second, suffix trees are 

main memory structures that are seldom used for disk resident data, and most commercial 
1 0 DBMSs do not have support for such structures. 

RIST improves the naive algorithm by eliminating costly suffix tree traversal. 

With RIST, when we reach a node X after matching a prefix of the query, we can "jump" 
directly to those nodes Yto which ^ is both a D- Ancestor and an S- Ancestor. Thus, we 
no longer need to search among the descendants of X to find such Y s one by one. More 
15 specifically, RIST is designed as follows: 

1. We index nodes in the suffix tree by their (Symbol, Prefix) pairs. This is 
realized by a B^Tree. It enables us to search nodes by (Symbol, Prefix), 
that is, by D-Ancestorship, because Prefix encodes ancestor-descendant 
relationships in the XML document tree. We call this BTree the 

20 D-Ancestorship B^Tree. 

2. Among all nodes satisfying D-Ancestorship, we are interested in those 
satisfying S-Ancestorship as well. We create labels for suffix tree nodes 
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so that we can tell S-Ancestorship between two nodes by their labels. We 
use B'^Trees to index nodes by labels. Such B^Trees are known as 
S-Ancestorship BTrees. 
We determine the D-Ancestorship between two elements by checking their prefixes. 
However, determining S-Ancestorship between two elements requires additional 
information. We label each suffix tree node jc by a pair {nx,sizex)y where rtx is the prefix 
traversal order of x in the suffix tree, and sizcx is the total number of descendants of in 
the suffix tree. Labeling can be accomplished by making a depth-first traversal of the 
suffix tree. An example of such labeling is shown in Figure 5. With the labeling, the 
S-Ancestorship between any two nodes can be decided easily: if x and y are labeled 
{rix.sizex) and {riy.sizey) respectively, node x is an S- Ancestor of node if and only if 
Hy e (nx^nx-^sizex]. 

To construct the B'^Trees, we first insert all suffix tree nodes into the 
D-Ancestorship B'^Tree using their (Symbol, Prefix) as keys. For all nodes x inserted 
with the same (Symbol, Prefix), we index them by an S-Ancestorship BT^ree, using the 
Hx values of their labels as keys. 

In addition, we also build a Docid BTTree, which stores, for each node x (using rix 
as the key), the document IDs of those XML sequences that end up at node x when they 
are inserted into the suffix tree. 

Figure 6 shows the index structure of RIST. In summary, the construction of the 
index structure takes three steps: 1) adding all structure-encoded sequences into a suffix 
tree; 2) labeling the suffix tree by making a preorder traversal; and 3) for each node 
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(Symbol, Prefix) labeled {n,size\ inserting it to the D- Ancestor BT^ree using (Symbol, 
Prefix) as the key, and then the S-Ancestor B'^Tree using n as the key. 

Suppose node x, labeled with {nx.sizex), is one of the nodes matching a query 
prefix ^1, • • •, qi-\ . To match the next element qt in the query, we consult the D-Ancestor 
5 B'^Tree using qi as a key. The D-Ancestor B'^Tree returns the root of an S-Ancestor 

B'^Tree. We then issue a range query nx<n<nx^ sizcx on the S-Ancestor BTree to find 
the descendants ofx immediately. For each descendant, we use the same process to 
match symbol until we reach the last element of the query. 

If node >^ is one of the nodes that matches the last element in the query, then the 
10 document IDs associated with or any descendant node of y are answers to the query. 
Based ony's label, say {riy.sizey), we knowj^'s descendants are in the range of 
(riy, Hy + sizey ] . Thus, we perform a range query [ny,ny-\- sizcy ] on the Docid B'^Tree to 
retrieve all the document IDs for>^ andj/'s descendants. 

The algorithm below formalizes the querying process: 
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Input: Q = gi, • a query sequence 

D-Ancestor B+Tree, index of (symbol, prefix) pairs 

S-Ancestor B"*"Trees, index of {riySize) labels 

Docid B+Tree, mapping between the n values in node labels 

and document IDs 
Output: all occurrences of Q in the XML data 

Search{{0,size),\)\ /* (O^size) is the label of the 
root node of the suffix tree */ 

f\inction Sear ch{{n^ size) yi) 
if i < \Q\ then 

T <— retrieve, from the D-Ancestor B^"Tree, the S-Ancestor 
B"'"Tree that represents qi; 

N ^ retrieve from T, the S-Ancestor B+Tree, all nodes with 
range inside (n, n -h size\\ 
for each node c € N do 

Assume c is labeled (n',5i;2!e'); 
Search{{n^ySize')yi H- 1); 

end 
se 

Perform a range query [n, n -I- size) on the DocId B'^'Tree to out- 
put all document IDs in that range; 
end 



If an element in the query sequence contains wild-card more than one 
15 S-Ancestor BTree may match the element. LotQ = {P,e\(L,P ^Xiv2,P ^L). Tomatch 
(L.P^X we issue a range query to the D-Ancestor B^Tree. The key of the D-Ancestor 
B+Tree is ordered first by the Symbol, then by the length of the Prefix, and lastly by the 
content of the Prefix. The search then continues on each S-Ancestor B'^Tree retumed by 
the range query. Note that we only need to handle or elements whose prefixes 

20 end with This is because the matching of {L,P^) will instantiate the in (v2,P * L) 
to a concrete symbol, which means(v2,P^I) is not considered as a wild-card query. 
Queries with wild-card V/' are handled as a series of queries. Thus, the index 
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supports wild cards and 7/' appearing both in the beginning and in the middle of a 
query sequence. 

In summary, unlike the naive algorithm, RIST does not use suffix trees for 

subsequence matching. From any node, instead of searching the entire subtree under the 
5 node, we can "jump" to the sub nodes that match the next element in the query right 
away. Thus, RIST supports non-contiguous subsequence matching efficiently. In 

comparison with many other indexing approaches that break a query down to pieces and 
then join the results, RIST has the advantage of querying tree structures as a whole. 
RIST uses a static scheme to label suffix tree nodes, which prevents it from 
1 0 supporting dynamic insertions. This is because for any node x labeled («, size), late 

insertions can change the number of nodes that appear before x (in the prefix order) as 
well as the size of the subtree rooted at x. This means that neither n nor size can be fixed. 

The sole purpose of the suffix tree is to provide a labeling mechanism to encode 
S-Ancestorships. Suppose a node x is created for element di during the insertion of 
15 sequence d\,.,.,di, ...,dk. If we can estimate 1) how many different elements will possibly 

follow di in future insertions, and 2) the occurrence probability of each of these elements, 
then we can label x's child nodes right away, instead of waiting until all sequences are 
inserted. It also means: 1) the suffix tree itself is no longer needed because its sole 
purpose of providing a labeling mechanism can be accomplished on the fly; and 2) we 
20 can support dynamic data insertion and deletion. 
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ViST uses a dynamic labeling method to assign labels to suffix tree nodes. Once 
assigned, the labels are fixed and will not be affected by subsequent data insertion or 
deletion. 

We present a dynamic method for labeling suffix tree nodes without building the 
suffix tree. The method relies on rough estimations of the number of attribute values and 
other semantic/statistical information of the XML data. The dynamic scheme presented 
herein is designed to label suffix trees built for structure-encoded sequences derived from 
XML document trees. 

A tree structure defines nested scopes: the scope of a child node is a sub-scope of 
its parent node, and the root node has the maximum scope which covers the scope of each 
node. Initially, the suffix tree contains a single node (root), and we let it cover the entire 
scope, [0,Max), where Max is the maximum value that the machine can represent under 
certain precision. Max = 2 - 1 if 8 bytes are used to represent an integer. Alternatively, 
16 bytes can be used for Max as large as 2^^^ - 1 . 

Semantic and statistical clues of structured XML data can often assist sub-scope 
allocation. Figure 7 shows a sample XML schema. We use pCwjx) to denote, in an XML 
document, the probability that node u occurs given node x occurs. For a multiple 
occurring node v, j9(v|x) denotes the probability that at least one v occurs given x occurs in 
an XML document. 

If X is the parent of w, usually it is not difficult to derive or estimate, from the 
semantics of the XML structure or the statistics of a sample dataset, the probability /?(wlx). 
For instance, if each Buyer has a name, then p(Name\Buyer) = 1. If we know that roughly 
10% of the items contain at least a sub-item, then piSubItem\Item) = 0.1. 
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We start with two assumptions: 1) we know probability p{u\k) for all m, where x is 
the parent of w; and 2) in XML document trees, sibling nodes occur independently of each 
other. As shown below, the second assumption can be relaxed. If node x appears in an 
XML document based on the schema in Figure 7, then each of the following symbols can 
appear immediately after x in the sequence derived from the document: «, v, w, y, z, and e 
{i.e., empty). Therefore, empty, x is the last element. These symbols form the "follow 
set" ofx. 

A formal definition of a follow set is as follows: Given a node a; in an XML 
scheme, we define the follow set ofx as a list. That is follow(x) =yi, where;;, 
satisfies the following condition: x<yi <yi+\ (according to prefix traversal order) and the 
parent ofyi is either x or an ancestor node ofx. 

It is straightforward to one of ordinary skill in the art to prove that only symbols in 
x's follow set can appear immediately after x. Suppose follow(x) =y\, ...,yk, based on the 
assumption that sub-nodes occur independently, we have: 

p(yi\x) = p(yi\dX where d is the parent of yi 

The equation above is trivial ifd = x. Ifd=^x, then based on the definition of the 
follow set, d must be an ancestor of ;c. Therefore, p(yi\x) = p(yi\x, d). Because x and yi 
are in different branches under d, it follows from the assumptions presented above that 
they occur independently of each other, which means/7(y,[x, d) = p(yi\d). 

Let follow(x) =y\,...,yk- The probability that x is followed immediately byy\ is 
p{yi\x:). The probability that x is followed immediately by;;2 is (1 -p(y\\x))p(y2\K). The 
probability that jc is followed immediately hy yi is (hereinafter referred to as "probability 
equation"): 
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P.(yi)=p(yi\x)lin-p(yk\K)) 

We allocate subscopes for the child nodes in the suffix tree according to the 
probability. More formally, if x's scope is [/, r), the size of the subscope assigned to j;,-, 
the i^*" symbol inx's follow set, is: 

Si = (r-l-l)P,(yi)/C 
where C = 2 Pxiy) is a normalization factor. No scope is allocated to e. 

yefollow(xy{e) 

In other words, we should assign a subscope [/i,ri) c [/,r) to>^/, where: 

// = /+l+(r-/-l)tPx(yy) 
n = // + 5i 

In the following situations, the follow set and the probability equation need to be 
adjusted: 

1) A same node can occur multiple times under its parent node. Let 
ollowQc) =y\, ...,yh If x occurs multiple times under its parent, then x also 
appears in followix). That is, followix) = j^i, ..,,yk, where the symbols 
before x are the descendants of x. Let the probability that an XML 
document contains n occurrences of x under d is pnix\d), then the 
probability that the (n-l)-th x is followed immediately by the n-th x is 

Pn(x\d)U(l-p(yk\x)l 

2) Nodes do not occur independently. The probability equation is derived 
based on the assumption that nodes occur independently. However, this 
may not always be true. Suppose, for instance, that in Figure 7 either u or 
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V must appear under jc, and p(u\x) = p{y\x) = 0.8. We have/o//oiv(x) = w, v 
because if either u and v must occur, it follows that there is no possibility 
that any of vv, z, or e can immediately follow x. Thus, we have: 
Px(w)=p(«lx) = 0.8 
PAu) = (1 -p(u\x))p(v\-^u,x) = 02x1= 0.2 
Assume we do not have any statistical information of the data or any semantic 
knowledge about the schema. All that we can rely on is a rough estimation of the number 
of different elements that follow a given element. The best we can do is to assume each 
of these elements occurs at roughly the same rate. This situation usually corresponds to 
attributes values. For instance, in a certain dataset, we may roughly estimate the number 
of different values for attribute CountryOfBirth to be 100. 

Suppose node jc is assigned a scope of [/, r). Node x itself will then take / as its 
ID, and the remaining scope [/ + 1, r) is available for x's child nodes. Assume the 
expected number of child nodes of x is L Without the knowledge of the occurrence rate 
of each child node, we allocate 7 of the remaining scope to x's first inserted child, which 
will have a scope of size (r - / - 1 )/L We allocate j of the remaining scope to x's second 

inserted child, which will have a scope of size = (r - / - 1 - 1 . The third 

inserted child will use a scope of size (r - / - 1 - 1 and so forth. 

Figure 8 demonstrates an example of dynamic range allocation with parameter 
X = 2. It shows that the k'^ child is allocated a range that is 1/2* of the parent range in 
size. As another example, assuming the expected number of sub-nodes of node;; is 100, 
then the ranges of those child nodes that are inserted among the first five occupy 1%, 
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.99%, .98%, .97%, and .96% of the parent range respectively. Apparently, the allocation 
method has a bias that favors nodes inserted earlier. 

More formally, according to the above procedure, for a given node x with a range 

of [/, r), the size of the subrange assigned to its k^^ child is Sk = x • 

5 ordinary skill in the art can prove that Sk = (r-l-l)(X-l )*~V>1*. In other words, we 
should assign a subrange [/jt, rjt) c to the k^*" child of node x, where: 

+(r-/- 1)0 -a- i/^Va^-O 

rk = lk + Sk 

Based on the above discussion, dynamic scope is defined as follow. The dynamic 
1 0 scope of a node is a triple <n, size^ k), where k is the number of subscopes allocated inside 
the current scope. Let the dynamic scopes of x and be = (nx, sizcx, kx > and 
Sy = {riy^sizcy, ky ), rcspcctively. Node;; is a descendant of x \fsy c= Sx^ That is, 
[riy, Hy + sizey) c [rix^ rix + sizcx)- 

Let r = r 1 , • • /it be a sequence. Each ti corresponds to a node in the suffix tree. 
15 Assume the size of the dynamically allocated scopes decreases on average by a factor of Y 
every time we descend from a parent node to a child node. As a result, the size of //'s 
scope comes to MaxPC^'^ , where Max is the size of the root node's scope. Apparently, for 
a large enough /, Max/Y^'^ 0. This problem is known as scope underflow. 

As we have mentioned, XML databases such as DBLP and IMDB are composed 
20 of records of small structures. For databases with large structures, such as XMARX, we 
break down the structure into small sub structures, and create an index for each of them. 
Thus, we limit the average length of the derived sequences. 
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If scope underflow still occurs for a given sequence -7 = ^ i , • • at we allocate 
a subscope of size k-i+l from node and label each element tk sequentially. If 
node ti-2 cannot spare a subscope of size - / + 1 , we allocate a subscope of size ^ - 1 + 2 
from node //-2, and so forth. Intuitively, we borrow scopes from the parent nodes to solve 
the scope underflow problems for the descendent nodes. To do this, we preserve a certain 
amount of scope in each node for this unexpected situation, such that it does not interfere 
with the dynamic labeling process, as described in greater detail above. Using this 
method, the involved nodes are labeled sequentially (each node is allocated a scope for 
only one child), and they cannot be shared with other sequences. However, they are still 
properly indexed for matching. 

The dynamic labeling algorithm and the index construction algorithm of ViST 
will now be presented. ViST uses the same sequence matching algorithm as RIST, as 
described in greater detail above. 

The algorithm below outlines the top-down dynamic range allocation method 
described above. The labeling is based on a virtual suffix tree, which means it is not 
materialized. 
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Input: p: parent scope 

e: symbol for which a subscope is to be created 
Output: s, a subscope inside the parent scope p 
p, updated parent scope 

Assume p = (n, size, k)\ 

if semantical/statistical clues for e is available then 

Assume e is the i*'*' symbol in the follow set of e's parent node; 
3 ^ {it,Si,0); /* 1% and Si are defined in Eq(4) 

and Eq(3) respectively */ 



else 



end 

p {n^size.k + 1); 
return s\ 



/* Ik and Sk are defined in Eq(6) 
and Eq(5) respectively */ 
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We use an example to demonstrate the process of inserting a structure-encoded 
sequence into the index structure. Suppose, before the insertion, the index structure 
already contains the following sequence: 

Docx = {P,e){S,P)(KPS)(vuPSN){L,PSXv2.PSL) 



The sequence to be inserted is 

Doc2 = (P.e){S,P)(L,PS)(v2.PSL) 

The index before the insertion oiDocl is shown in Figure 9(a). For presentation 

simplicity, we make two assumptions: 1) Mca = 20480, that is, the root node covers a 

15 scope of [0, 20480); and 2) there are no semantic/statistical clues available and the 

top-down dynamic scope allocation method uses a fixed parameter /I = 2 for all nodes. 

The insertion process is much like that of inserting a sequence into a suffix tree. 

That is, we follow the branches, and when there is no branch to follow, we create one. 

We start with node (P, e), and then (5, P), which has scope < 1 , 5 120, 1). Next, we search 

20 in the S- Ancestor BT^ree of (Z, PS) for all entries that are within the scope of [2, 5 1 20). 
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The only entry there, <4, 640, 1 >, is apparently not an immediate child of < 1 , 5 1 20, 1 >. As 
a result, we insert anew entry (2561, 1280, 1), which is the 2nd child of {S,P), in the 
S-Ancestor B^Tree of (L,PS). The scope for the (S,P) node is updated to <1, 5120,2), as 
it has a new child now. Similarly, when we reach (v2,PSL), we insert a new entry 
(2562, 640, 0). Finally, we insert key 2562 into the Docid B'Tree for D0C2. The resulting 
index is shown in Figure 9(b). 

The algorithm below details the process of inserting an XML sequence into the 
index structure. 



10 



15 



20 



Input: T: a structure-encoded sequence id: ID of the XML doc- 
ument represented by T 
Output: updated index file F 

Assume T = {auk),--, (ai, {^k, k); 

s <— {OyMax,k)] /* 3 is the scope of the root node of 
the virtual suffix tree */ 

i ^ 1; 

while « < A: do 

Search key [oiJi) in the D-Ancestor B"*"Tree; 
if found then 

I e the S-Ancestor B'^Tree associated with (ai,Zi); 
else 

I e ^ new S-Ancestor B"^Tree; 

Insert e into the D-Ancestor B'^Tree with key (aiyli); 

end 

Search in e for scope r such that r is an immediate child scope of 

3 ; 

if not found then 

r <— (n,size, fc) <— subScope(s,a^) ; 
Insert (n, size) into S-Ancestor B'*"Tree e with n as key; 
end 
5 f ; 
ii-i+1; 
end 

Assume s — {n,size,k); 

Insert (n, id) into the DocId B+Tree; 



YOR920030392US1 (8728-652) -28- 



In conclusion, the ViST, a dynamic indexing method for XML documents, has 
been described in detail herein. We convert XML data, as well as structured XML 
queries to sequences that encode their structural information. Efficient sequence 
matching algorithms are introduced to find XML documents that contain the structured 

5 queries. While currently known XML indexing methods have difficulty in handling 

queries containing branches, insofar as most of them first disassemble a structured query 
into multiple sub-queries, each handling a single path in the structured query, and then 
join the results of the sub-queries to provide the final answers, ViST uses the structures as 
the basic unit of query, which enables us to process, through sequence matching, 

10 structured queries as a whole, and as a result, to avoid expensive join operations. In 
addition, ViST supports dynamic insertion of XML documents through the top-down 
scope allocation method. Finally, the index structure of ViST is entirely based on 
B^Trees, which, unlike some specialized data structures used in other approaches, are 
well supported by DBMSs. 

15 The particular embodiments disclosed above are illustrative only, as the invention 

may be modified and practiced in different but equivalent manners apparent to those 
skilled in the art having the benefit of the teachings herein. Furthermore, no limitations 
are intended to the details of construction or design herein shown, other than as described 
in the claims below. It is therefore evident that the particular embodiments disclosed 

20 above may be altered or modified and all such variations are considered within the scope 
and spirit of the invention. Accordingly, the protection sought herein is as set forth in the 
claims below. 
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